DRAFT - Protocols for de Novo whole genome assembly, annotation, and identification of strains (v2022.0.0)

Assembly, annotation, sppID, and characterization

1 Protocol Notes

Note: This is an html document and best viewed in a web browser.

We are working with endophytes, microorganisms that colonize the internal tissues of plants. We have isolated many strains of these endophytes from native, wild plants and are studying them for their ability to improve host-plant tolerance to different environmental stresses. We perform experiments to qualify and quantify a range of potentially beneficial traits using empirical methods. We also test whether the endophytes can be introduced to a novel host, and whether the new host will benefit - which is measured by improved growth/productivity in the host under environmental stress.

Strain vs. Species

Reviews on endophytes: (Reinhold-Hurek and Hurek 2011; Hardoim et al. 2015)

Because phenotypes (observed traits) are explained by genotypes (genetic traits), our investigations would benefit from having the DNA sequences of the endophytes. This information can inform both our applied and basic science experiments.

Accordingly, we have sequenced several of our endophytes and received the raw sequence reads as data files. We will use those files to assembly, annotate, and identify these strains. In subsequent protocols we will mine the genomes for genes and pathways of interest.

Whole genome sequence (WGS) assembly is the process whereby raw sequence reads are assembled into a draft genome. The assemblies are typically completed either by read mapping, which uses a high-quality reference genome of a closely related species as a guide to assemble the draft genome, or as de Novo, where the draft genome is assembled using only the information contained in the sequence reads.

The best analogy of a de Novo assembly that I have seen is that it is like putting a book back together after it has been through a shredder - without having an original copy of the un-shredded book. Both protocols have their pros and cons but we will be performing de Novo assemblies because we are working with the genomes of environmental strains which are less well curated than the genomes of clinical strains, and de Novo assemblies have less bias.

Recommended literature:

These are introductory papers for WGS process. Associated papers for each step are including in with the section.

This docuement is for de Novo WGS using Illumina paired-end reads, and contains our current protocols for assembly, quality control, annotation, and strain identification. For each steps there are many alternative protocols available. This document contains only the protocols that we are currently using.

In order to maintain Reproducible research it is important to record software version and date used for each step in the protocol. Updates and new versions can alter analysis and output, and may include bugs. If re-analysis is required, use same version of software from original analyses where possible.

All relevant literature should be cited in the appropriate section and a reference list included at the end of the document. These documents can be found in the Doty Lab protocols notebook and/or the Doty Lab MS Teams Literature folder. For ease of use, it is recommended to import a copies of the articles into your literature management software, e.g., Zotero.

1.1 How these protocols are structured:

Each section includes background notes, the required steps, and concludes with examples (blocked out) used from a prior experiment. The background notes and examples don’t need to be included in your notebook. The steps listed are the standard methods/activities and any deviations or revisions should be thoroughly noted.

Screen captures of the key steps are included for clarity and also don’t need to be included in your notebooks.

The code chunks are printed in this document (echo=T) as part of the protocol instructions, but would normally be hidden in the html document (echo= F).

These protocols assume that you are already somewhat familiar with basic computer commands, R Studio/Markdown, and our Doty Lab Digital Notebook formatting and style guides for file structure, naming, coding, etc.

It also assumes that you have installed and configured Geneious Prime (GP) per the instructions and completed the introductory tutorials.

1.2 Basic pipeline for assembly, annotation, and species identification.

  1. Create project/experiment directories and R Markdown notebook as needed.
  2. Obtain the raw seq. read files, either downloaded from sequencing lab or provided by a colleague.
  3. Import raw seq. reads into Geneious Prime (GP)
  4. Trim and Normalize raw seq. reads in GP
  5. Perform assembly in GP
  6. Filter contig sequences to \(\ge\) 1000 bp
  7. Export contig files and review assembly quality using CheckM
  8. Import contig files into RAST for annotation
  9. Download annotation files and import into GP
  10. Import contig files into TYGS for type-strain whole genome identification
  11. Summarize findings

You will need to start the document with the formatted YAML template and the setup code chunk. Neither of these will appear on the final document, but the YAML contains formatting and output instructions and the code chunk loads the necessary R packages to format data and tables.

Both of these topics are covered in the Digital Notebook protocol, and a template containing both the YAML and the setup code chunk can be found in our MS Teams Bioinformatic Protocols/scripts folder.

1.3 Basic Section Layout

Per our lab Digital Notebook protocols, each step should include the relevant information for each application in this order:

  1. Information about local software or online platform used
  2. Methods: What was done
  3. Results: Data produced
  4. Brief summary of results
  5. The submitted and output directory/files pathway

In detail:

  1. The information about the software/platform should include:
  • Name, version, and literature references. 

    • In the case of online platforms, include website link, date accessed, and whether data is stored on their servers or downloaded (or both). In the case for online data, note approximately how long it will be maintained on the platform server.
  1. Methods: A brief description of what was done for each step. As with all methods, should contain enough detail that another researcher could reproduce your steps. This could include:
  • Type of files used (submitted); including directory and file names
  • Software settings and/or options
  • Deviations or revisions from standard protocols 
  1. Results: Typically a simple data table is produced of relevant output, or description of output.

  2. Summary of findings: A brief summary of the results, providing sufficient information to write formal methods and results for publication. Note any interesting, unusual, or unexpected findings.

  3. Document the directory pathway (where files originated or were stored) and the file name format for the files. Note: No need to list all of the file names, just the abbreviated (wildcard) format.

Note: Figures and summaries do not need to be publication quality. The information just needs to be accurate and clearly presented; sufficient enough other researchers to produce formal output for publication.


2 de Novo assembly, annotation, and identification protocols

2.1 Project/Experiment Information

2.1.1 Introduction

Short introduction for experiments in this notebook; referencing which project and the relevant details sufficient enough to give another reader familiar with the project the necessary information to understand what is contained in the notebook.

2.1.2 Collaborators

Names and contact information collaborators (as needed). For smaller projects, this will be the lead researcher in our lab, where larger projects may include external collaborators.

Robert Tournay, Doty Lab UW,

2.1.3 Summary of findings

This is the last step. When all analyses are completed you will write up the summary of findings in this section, with the Tables linked.

2.1.4 Location of data files.

Identify location all associated files. These should include the file pathways for:

  • Local project files stored on your UW Google Cloud storage.
  • The GitHub repository in the Doty Lab UW github account.
  • Location of the raw data sequence files. These are archived on a UW Google account and are not included in the GitHub repository. Copies of the files can be saved in the local project folder for import to Geneious Prime for assembly.

The GitHub link can be found by going to the Project Repository, clicking the green Code button and copying the link.

If you pull a copy of the raw sequence read files into your local repository you will need to add it to the .gitignore file due to space limitations in GitHub.

Example:

Local project files: ~/Projects/Proj-DOE
Github: https://github.com/Doty-Lab-UW/Proj-DOE.git
Raw data: ~/Bioinformatics/raw_data/DOE

2.1.5 DNA Isolation and Sequencing

Notes on the extraction of gDNA and library prep as appropriate. Because these steps are field and bench experiments, the detail can be minimal. If these steps were completed by other researchers, then that can be noted.

Example:

Collection, isolation, gDNA prep by Doty/Sher. Collection and isolation per Doty Lab protocols. DNA extractions using Nextera DNA Flex kit (Illumina, Cat. No. 20018705).

Information of lab that did the sequencing.

  • Include name of lab (link to website)
  • Kit names and IDs
  • Sequencing lab’s project numbers
  • how and when the raw data were obtained

Example:

Library construction, amplification and sequencing by GENEWIZ,Inc.

  • Reference Project #30_459016717
  • MiSeq platform (300-Cycle V2 Reagent Kit)
  • Sequences (fastq.gz) downloaded from Genewiz on 2021-06-17

2.1.6 Strains used in this study

Include list of strains assembled. Often, these have been identified using 16S rRNA. The 16S rRNA method is typically reliable to Genus, but revisions may occur when the identification is completed using the whole-genome sequence alignments here. The 16S rRNA method is less reliable for identification to the species level, although it should be included on the table when provided.

For one, or a few, can be bullet list. For longer lists, a table can be built. Tables can be constructed in R Markdown or, more commonly, imported from excel or other document (.csv, .tsv, or similar simple format).

A table built in R Markdown:

Strains Genera Species (if provided)
WW5 Sphingomonas sp.
SherDot1 Azotobacter beijerinckii
Root10 Rahnella aquatilis
NS Paraburkholderia sp.

Alternatively, a table can be built quickly using an external document and the kableExtra and tidyverse packages.

###############
# There are many ways to import data and create tables in R Studio. 
# These tables do not need to be "publication" quality, and so you don't need
# to spend a lot of time formatting. Just make them readable.  
#
# I created a /data sub-directory in the /rmd directory and imported an 
# excel .csv file containing the strain data.  
#
# Recall that normally, code chunks are not included in report outputs. Here, I
# used the option echo=T to print the code chunk, but normally the call would be
# echo=F, to hide the code on the output document.  
# 
# Also, note that the "kableExtra" and "tidyverse" packages were previously
# loaded into the R environment.  
# 
# Below here is what I would actually include in my code chunk .  
###############

# This command imports the data into R and saves it as the variable strains
strains <- read.csv("data/DOE_16S_sppID.csv")

# This command selects only the columns containing Strain and scientific name
strains <- strains %>%
  select(Strain, Spp_16S_rRNA)

# Code for creating a captioned table, with minor formatting
kbl(strains, caption = "Strains included in this study") %>%
  kable_styling(full_width = F) %>%         # full_width=F for narrow tables
  column_spec(2, italic = TRUE)             # italicize scientific names (col 2)
Table 2.1: Strains included in this study
Strain Spp_16S_rRNA
WW5 Sphingomonas sp.
SherDot1 Azotobacter beijerinckii
Root 10 Rahnella aquatilis
NS Paraburkholderia sp.

2.2 de Novo Assemblies using Geneious Prime

There are many methods (both CLI and GUI) for de Novo assemblies. This protocol uses Geneious Primer (GP) software, a GUI platform that can perform various assembly methods, e.g., SPAdes and velvet, in addition to their own. There are strengths and weakness for each of the assembly methods (Forouzan et al. 2017). I have found that both GP and SPAdes provide high quality assemblies with good coverage. These protocols will describe the GP assembler, but it is good to know that there are other options available.

The general de novo assembly pipeline in GP is:

  1. Import paired-end raw seq. reads files into named folder on GP
  2. Trim the reads using BBDuk plug-in 
  3. Normalize the trimmed read files using BBNorm plug-in
  4. Assemble the trimmed & normalized read files using the GP assembler
  5. Record the assembly data in a .csv file and create a table of relevant data 
  6. Extract contigs \(\ge\) 1000 bp into <strain>-Contig.fasta file for annotation
  7. Summarize results
  8. Export <strain>-Contig.fasta into Local project directory/folder and update version control on Github

In detail:

  1. Import paired-end raw seq. reads files into named folder on GP

This protocol assumes that you have already installed Geneious Prime, and have taken the set-up tutorials. You should already have the plug-ins installed, know how to create project directories and folders, and how to import Illumina paired-end raw seq. read files.

Your GP should look something like this:

Thumbnail: click to enlarge

  1. Trim the reads using BBDuk plug-in 

The data files contain millions of short (~250 bp) sequence reads that will we are assembling into a draft genome. Due to the high number of reads, there is \(\gt\) 30x sequence coverage for each nucleotide. Each nucleotide on every read has a quality score (phred) that represents the certainty that that nucleotide is the correct one - and not the result of PCR or sequencing errors.

BBDuk trims the seq read files based on minimum phred scores and short lengths. These parameters are set by the user. To trim the reads, select your strain file in GP and then Annotate and Predict > Trim using bbDuk. A pop-up window will open, record the BBDuk version and complete the boxes and options like this:

  • Trim adapters: unchecked
  • Trim Low Quality: checked, Trim: Both Ends, and Minimum Quality 30
  • Trim adapters based on paired read overhangs: Checked, Minimum Overlap: 20
  • Discard Short Reads: checked, Minimum Length: 20 bp
  • Trim Low Complexity (Entropy): unchecked
  • Keep original order (slower, but ensures…): checked
  • Max. Memory: default
  • Custom BBDuk options: blank

The bolded parameters are reported and should be recorded in your notes.

Thumbnail: click to enlarge

This step should take a few minutes, depending on CPU power and available RAM.

GP creates a new file adding (trimmed) to your strain name, the un-trimmed file is retained for comparison and reproducibility.

  1. In a second quality control step, the trimmed read file will be normalized for depth of coverage. As noted above, each nucleotide in the genome has been redundantly sequenced many times. The term for how many times a read is sequenced is called Depth of Coverage, or DOC. Most NGS technologies will target a average DOC of 30x - 90x. However, not all nucleotides were sequenced the same number of times. Some sequences are over-represented, and past a certain point extra data doesn’t improve results, but it does increase computation time.

The BBNorm plug-in normalizes the DOC based on user set parameters. Select the strain (trimmed) file, and then Sequence > Error Correct and Normalize Reads.... A pop-up window will open, record the BBNorm version and complete the boxes and options like this:

  • Error correction: unchecked
  • Normalization: checked, Target Coverage Level: 40, Minimum depth: 6

The bolded parameters are reported and should be recorded in your notes.

Thumbnail: click to enlarge

This step should take a few minutes, depending on CPU power and available RAM.

GP creates a new file adding (trimmed)(normalized) to your strain name. Again, previous files are retained for comparison and reproducibility.

Note: If BBDuk and BBNorm make you feel like you are throwing away data, it is because you are. However, due to the high-quality sequencing and DOC of modern NGS technologies, you are eliminating noise that will increase your assembly time, and may reduce your downstream results in the long run. If you want to convince yourself of this you can run the assembly and compare downstream applications on all three versions.

  1. The de Novo assembler programs use complex algorithms, e.g., (Rizzi et al. 2019), to assemble the raw seq. reads (~ 250 bp) into contiguous sequences, called contigs, ranging from as long as hundreds of thousands of bp down to hundreds of bp in length; each contig representing a portion of the genome. Accordingly, draft genome assemblies are composed of a set of contigs rather than one continuous sequence representing the entire genome. However, this is sufficient for most downstream applications.

The quality of the draft assembly is affected by a range of factors, including the quality of upstream processes (e.g., DNA extraction, PCR amplification and sequencing) and the complexity of the organism’s genome. Accordingly, the GP assembler will produce a report that contains information on the quality of the assembly.

Select the <name of strain>(trimmed)(normalized) file and then Align/Assemble > De Novo Assemble... to open up the parameters window. Complete as follows:

  • Assembler: Geneious

  • Sensitivity: Medium-Low Sensitivity/Fast

  • Trim Before Assembly: Do not trim

  • Results: 

    • Assembly Name: default
    • Save assembly report: checked
    • Save list of unused reads: unchecked
    • Save in sub-folder: checked
    • Save contigs (checked Maximum 1000)
    • Save Consensus Sequences: checked, default options
  • Circularize contigs of \(\ge\) 3 sequences, if ends match: True

  • Produce Scaffolds: True

Thumbnail: click to enlarge

This is the assembly step and should take significantly longer (hours) depending on the CPU power, available RAM, and quality/complexity of the read files. You can reduce the time by only assembling a single strain at a time and closing other applications.

This step creates a sub-directory in GP called <name of strain>(trimmed)(normalized) Assembly that contains the assembly report (txt document), a consensus file, and the contig files. Depending on the assembly, there could be dozens or hundreds of contigs, although most will be short, low coverage contigs that can be filtered out without a loss of quality.

  1. In the Assembly directory, click on the file Assembly Report and record the following information in an excel spreadsheet or text document in a table format.
  • Strain and Genus

  • From the Contigs >= 1000 bp column:

    • Length Sum (bp)
    • Number of
    • Max Length (bp)
    • N50 length (bp)
  • Sort the contig files by length, and record:

    • The %GC and Mean Coverage (DOC) of the longest contig.
  • Sort the Topology by type and record the quantity (n) and length (bp) of circular sequences \(\ge\) 1000 bp. These are GP assembler predicted plasmids. Some assemblies may not have any, others may have several.

Save the table as a .csv file, and place it in the ~/rmd/data folder in your local project folder.

This image shows how the data look in excel and an R Studio text file.

Thumbnail: click to enlarge

And an example of the Table as shown in the report:

#This creates a variable to store the caption text, keeps code below cleaner
gp_cap <- "Geneious Prime Assembly Statistics"

# This command imports the data into R and saves it as the variable strains
gpStats <- read.csv("data/GP-stats.csv")

# Creates formatted table from data 
kbl(gpStats, caption = gp_cap) %>%
  kable_styling() %>%
  column_spec(2, italic=T)
Table 2.2: Geneious Prime Assembly Statistics
Strain Genus Genome_Mbp Contigs Max_Len N50 pctGC meanDOC Plasmids P1 P2
WW5 Sphingomonas 5.76 118 252688 88302 64.00 40 FALSE NA NA
SherDot1 Azotobacter 4.93 154 298403 75606 65.90 50 TRUE 77262 NA
Root 10 Rahnella 5.71 46 870763 74700 52.11 na UNK NA NA
NS Paraburkholderia 8.05 84 853358 169632 62.50 38 TRUE 37673 13240

This table covers the basic QC parameters for genomic assemblies. Identifying good assemblies comes with practice, but generally you want to see:

  • Are the genome size and %GC reasonably comparable to other strains in the same Genus?

    • A preliminary review can be done with a quick internet search by “ AND genome size” or “ AND %GC,” to answer the question: Are the results reasonable?
    • For example, notice with this data set that the Paraburkholderia genome is large relative to the other strains. A quick internet search will confirm that a genome of ~8 Mbp is reasonable for this Genus. The same process can be followed with %GC. A more formal investigation can be done using the literature.
  • Fewer, longer contigs are considered more complete assemblies, where many, shorter contigs could indicate assembly issues or noise in the data set. You can also look at the longest contigs - to confirm whether they contain most of the sequence data.

  • N50 is analogous to the median contig length; longer is better.

  • DOC for the longest contigs should be better than 30x coverage, and higher is better. If you look at the contig files, you will notice the DOC drops significantly on the shorter contigs.

These are rough guides to the assembly quality, and what you are looking for are anomalies that stand out, unreasonable genome size or %GC for the genus, high numbers of short contigs, or a short N50, etc. In the next step, we will do a further QC.

  1. With that in mind, the next step is to extract the contigs \(\ge\) 1000 bp into a separate file, effectively filtering out the short reads. The 1000 bp cut-off is arbitrary and some protocols retain contigs \(\ge\) 500 bp. Recall that there is redundancy in the data (DOC) and the shorter contigs are lower quality and have more noise - as reflected in the DOC and GC% - and this step is filtering out that noise.

To filter the contigs, select the Consensus Sequence file, the Lengths Tab, and then the Extract button. This will open a pop-up window where you can specify the Extract sequences with minimum length to 1000 bp. No other changes.

This will create a new Consensus Sequence file containing only the contigs that are \(\ge\) 1000 bp. Rename this to <strain>-ConSeq. This is the working draft assembly file.

  1. Now you should write a brief summary of the results, stating your interpretation of the quality of the assemblies, whether they seem reasonable, and noting any problems, exceptions, or deviations from the sample protocols. It should be enough detail that your collaborators (or you, six months from now) can quickly scan the summary and interpret the results.

If there were no issues during assembly, this write up should be very brief; and will encompass only a sentence or two when submitted for publication.

  1. The final step is to export the <strain>-ConSeq.fasta into Local project directory/folder and update version control on Github

Select the filtered file in GP, then Export > Export Documents and save the file to your Project folder under /output/GP-assembly/<strain>-ConSeq.fasta and then document the directories and files.

Example GP Assembly notes from another project:

Geneious Prime (v2021.2.2)
BBDuk and BBNorm (v38.84)
R version 4.0.2 (2020-06-22) – “Taking Off Again”

  1. Paired end read files were imported and combined into a single file.

  2. Reads were trimmed BBDuk with the phred score was set to minimum 30, and reads < 20 bp discarded.

  3. The trimmed read files were then normalized for coverage, with the target depth of coverage set to 40 and the minimum set to 6.

  4. Trimmed and normalized read files were then assembled using the Geneious Prime assembler with the following settings:

    • Assembler: Geneious
    • Sensitivity: Medium-Low Sensitivity/Fast
    • Trim Before Assembly: No
    • Save Consensus Sequences: TRUE
    • Circularize contigs of => 3 sequences, if ends match: True
    • Produce Scaffolds: True
  5. Assembled consensus sequences were filtered for contigs \(\ge\) bp for downstream applications.

Normally the assembly table shown above would be inserted here in the project notebook. Not included in this example for clarity.
###########

Four endophyte strains were assembled from paired-end raw sequence reads using the Geneious Prime software. The trimmed and normalized read files were assembled using the native Geneious Prime assembler. The draft genome-size and GC content for each strain appeared reasonable when compared to the whole genome-assemblies from members of the same genus.

Imported files: ~/raw_data/Apple-Bio/GENEWIZ-files/00_fastq/*.fastq.gz
- Where * is unique file identifier
Output to: ~/output/GP-assembly/<strain>-Conseq.fasta

2.3 Quality of assembly with CheckM (KBASE)

An additional check of the assembly quality can be performed using the CheckM application to assess genome coverage and screen for sequence contamination.

This online application is part of the KBASE platform of the US Department of Energy. You won’t be loading software onto your computer, but accessing their servers remotely. If this is your first time accessing KBASE you will need to create an account. The accounts are free and can be created with your uw.edu email address.

After creating an account, you should see a dashboard that looks something like this:

Thumbnail: click to enlarge

On the left you will see my Narratives, or project directories in KBASE.

To start a narrative, click the green + New Narrative button in the upper right corner. That will take you to this screen:

Thumbnail: click to enlarge

If you look at the top, you will see your new narrative title (untitled), which can be changed by you.

You can see three main sections, The upper left - where your data will be uploaded and stored. The lower left - different apps for the data. And the main section - which, for now, has links for documentation and tutorials.

The tutorials will show you how get your data from your local Project directories into your newly created narrative.

You need to import your <strain>-ConSeq.fasta files from your assemblies. If you are working with multiple strains, you can upload all of them into the same narrative at the same time.

General steps:

  1. Consensus files: <strain>-ConSeq.fasta
  2. Load data, define data type (i.e. Assembly), and import them into your narrative.
  3. Run the CheckM application once for each file.
  4. Record and analyze the output data.

CheckM can be found in the Apps panel at the bottom left of your dashboard. If you click on Genome Assembly and scroll down to CheckM, and click the app will open 1 instance in your main panel. From that instance you can run analysis on a file by selecting it from the pull-down menu. You repeat this and open and run another instance of CheckM if you have multiple strains to analyze. Analysis will take less than 30 minutes, depending on server load.

There are no files to download with this app; actually, there are, but it is not worth the effort. We are only interested in two values:

  • Completeness (should be \(\ge\) 95%)
  • Contamination (should be \(\le\) 5%)

These numbers can be found on the CheckM TABLE tab on the report.

Thumbnail: click to enlarge

I just record them in my notebook. If the assemblies fail the CheckM metrics then that indicates problems with the one of the prior steps; i.e., DNA extraction, PCR amplification, sequencing, or assembly steps, and further analysis is necessary.

Example:

KBASE
Date accessed: 2021-10-11
CheckM version 1.0.18
Reference: (Parks et al. 2015)

The GP ConSeq assemblies, filtered \(\ge\) 1000 bp, were analyzed in CheckM for the quality of the assembly; i.e., completeness of coverage and contamination. Thresholds for high quality assemblies are \(\ge\) 95% completeness and \(\le\) 5% contamination.

All strains were found to have \(\gt\) 99% coverage with \(\lt\) 0.7% contamination (data not shown, but recorded incheckM-res.csv).

Submitted files: ~/output/GP-ConSeq/<strain>-ConSeq.fasta
Output directory/files: ~/r-nb/data/checkM.csv

You have now assembled a high-quality draft genome that is ready for downstream analyses. There are other protocols to further refine the genome, such as mapping the contigs to a reference genome, but this stage us sufficient for our needs.

2.4 Annotation of assembled draft genomes

Once the assemblies are complete and meet QC criteria, the next step is annotation. Annotation is the process of identifying the open reading frames (ORF) from the coding sequences (CDS). Terminology. From this information annotation protocols attempt to predicts the protein-encoding, rRNA and tRNA genes, and assigns functions for those previously described. Using that information, different subsystems and pathways can be investigated.

This process depends on the quality of the assembly, the algorithms used, and how well the genes and/or species have been studied. For example, a high-quality assembly of a well described pathogen, such as Erwinia amylovera or Pseudomonas aeruginosa, will likely be more fully annotated than a novel strain of some of less well studied species, like Rahnella aceris. Similarly, well studied genes are more likely to be correctly annotated, even if they occur in a novel strain.

Previously, annotations were completed manually, a laborious and time-consuming effort, requiring extensive expertise. Now, draft annotations are done computationally using various algorithms to identify candidate genes and results are generated in a fraction of the time. Just as with the assembly, there are a variety of different annotation protocols. In fact, it is not unusual to have a draft genome annotated by several programs to cover more of the genome.

We are going to start with a common annotation pipeline called the Rapid Annotation using Subsystem Technology (RAST). You will need your own account (free), and for me the process took several days - so you will want to plan ahead.

Once you have the account, the process is straight forward. You will begin by initiating a new job.

Thumbnail: click to enlarge

On the next screen, attach and upload your <strain>-ConSeq.fasta file and submit. Next, complete the strain metadata:

  • Taxonomy ID: blank
  • Taxonomy string: blank
  • Domain: bacteria
  • Genus: input genus
  • Species: sp.
  • Strain: strainID
  • Genetic code: 11 (Archaea, most Bacteria…)

and finally the RAST parameters:

  • RAST annotation scheme: RASTtk
  • Customize RASTtk pipeline: unchecked
  • Automatically fix errors: checked
  • Fix frame shifts: checked
  • Build metabolic model: checked
  • Turn on debug: checked
  • Set verbose level: 0
  • Disable replication: unchecked

Then submit the job. You will receive an email when it is complete.

Jobs can be found in Your Jobs > Jobs Overview. The annotation ID is the reference number that will be on future downstream analyses, and you can view details to the next screen.

Thumbnail: click to enlarge

On the next screen you will need to download three files to the ~/output/RAST directory in your project folder:

  • Amino-Acid FASTA file, as <strain>-aaRAST
  • Genbank, as <strain>-gbRAST
  • Spreadsheet (tab-separated text format), as <strain>-txtRAST

These files contain basically the same information, but in different formats, and will have the extentions: .fasta, .gbk, and .txt, respectively.

Thumbnail: click to enlarge

Finally, follow the Browse annotated genome in SEED viewer to record:

  • Number of subsystems
  • Number of Coding Sequences
  • Number of RNAs

For one or a few genomes, it is easiest to just record directly into the notebook, while for many strains a table may be more appropriate.

Ex: > Strain 3YPLD (Pantoea), Subsystems: 355, Coding seq.: 5094, RNAs: 88

Thumbnail: click to enlarge

The SEED viewer presents the annotation data in graphic and tabular formats, and you can search for genes and subsystems (e.g. protein secretion systems), do comparative analyses, view diagrams, and much more. At this point we are collecting data and will return to look at these features later, but feel free to explore.

You should have three files saved, and a few data recorded.

Example:

Rapid Annotation using subsystem technology (RAST)
RAST version 2.0, upload dates: 2021-08-02
Data stored indefinitely (not specified) in user’s page
Reference: (Aziz et al. 2008)

Consensus sequence files (filtered for contigs \(\ge\) 1000 bp) were submitted for annotation.

The following options selected:

  • RAST annotation scheme: RASTtk
  • Customize RASTtk pipeline: unchecked
  • Automatically fix errors: checked
  • Fix frame shifts: checked
  • Build metabolic model: checked
  • Turn on debug: checked
  • Set verbose level: 0
  • Disable replication: unchecked
Table 2.3: RAST annotates assembled genomes. The job# under my account and on file.
Strain CDS Subsystems RNAs
1SSLD 4768 337 89
2ALA1 5286 379 72
2PtLD 4978 376 97
2RDLD 5139 372 106
3RS3 5413 356 82
3ThS2 6273 379 62
3WL2 3237 297 72
3YPLB 5646 369 74
4RDLA 4847 355 81
4RLE 5524 360 92

No summary.

Submitted files: ~/output/GP-ConSeq/<strain>-ConSeq.fasta
Output directory: ~/output/RAST/
Output files:

  • <strain>-annRAST.faa the amino acid sequences by PEGs
  • <strain>-annRAST.gbk GenBank format
  • <strain>-annRAST.txt full file, incl. na, aa, protein, PEGs
  • Identification

The final step in this tutorial is the identification of the strains. Traditionally, identification is done by sequencing the 16S ribosomal RNA gene. The 16S rRNA gene (~1500 bp) is highly conserved due to its role in transcription, present in all bacteria, fast and easy to target and amplify, and there exists a very large database of bacterial 16S rRNA sequences for phylogenetic analysis.

Quick Primer on 16S rRNA

However, because of the dynamic nature of bacterial genomes, and the fact that many species harbor multiple distinct 16S rRNA genes, this method of identification is considered reliable only the the Genus level.

See: (Janda and Abbott 2007), particularly Issues section.

That does not mean that 16S rRNA sequencing is not useful. It is still widely used as a quick, inexpensive first-step identification when working with a new or unknown isolate. It is also widely used in microbiome analysis of mixed spp. samples.

Instead, we are going to use an identification method utilizing the whole-genome sequences of our strains. The assembly data will be aligned against a database containing type-strains genomes. Type-strains, or reference strains, represent the original isolates used in species and subspecies descriptions for taxonomic classification, and which exhibit all of the relevant phenotype and genotypic properties for the given species.

We are going to upload the assemblies to the Type (Strain) Genome Server (TYGS), which contains a comprehensive database, \(\gt\) 15k species and subspecies, of validly published type strains for whole-genome based methods for phylogeny and taxonomic classification. Like the previous steps, there are a variety of protocols for identification of species, using different methods.

Similar to the RAST annotation step, you will upload your annotated assembly to the TYGS server, which will complete the analysis and prepare data files for download.

Under the Submit your Job link, upload your annotated assembly (the RAST file), you can upload either the <strain>-aaRAST.faa (amino-acid) or the <strain>-gbRAST.gbk file. Leave, everything else at default (blank), enter your email and submit. You will receive an email when the job is complete with a link to the results. The processing will take a few hours to a few days, depending on server load.

TYGS infers species/subspecies taxonomic classification and phylogenies using digital DNA:DNA hybridization (dDDH) of whole genomes against the type-strain. A dDDH d4 score of >70% is the threshold for classification to the species level. You can look at the information icon for more information on the dDDH categories.

Table 1 contains the phylogenies results. You can view phylogenetic trees based on both the 16S rRNA sequences and the WG sequences.

Table 2 shows the results of the identification. It will report the strain as identified as X species, or as a “Potentially new species.”

Table 3 shows the dDDH scores w/ Confidence Intervals and %GC difference for the closest type-strain matches. The d4, \(\ge\) 70%, is used for identification.
Table 4 is the reference literature for compared type-strain species.

There are five files to download. Save them in the ~/output/TYGS folder in the project drive.

  1. Table 1: download both “Newick” files for the strain. Save as <strain>-16Sphy.txt and <strain-WGphy.txt.

  2. Table 3: download a .csv formatted file of the dDDH scores; <strain>-dDDH.csv

  3. Table 4: download a .csv formatted file of the dDDH scores; <strain>-ref.csv

  4. Top of the screen: Download PDF Report to save a PDF version of the report; `-TYGS. Besides the summaries, this file contains methods and results sections that can be used/modified for formal reports.

Note: Newick is a standard format that can be imported into a variety of apps (including Geneious Prime) to produce graphic versions of the phylogenetic trees. Example:

Type (Strain) Genome Server (TYGS)
TYGS v321, upload date: 2021-08-02
Data stored for 60 days on site under unique ID, downloadable
Reference: (Meier-Kolthoff et al. 2013)

The annotated assemblies were submited to the Type (Strain) Genome Server (TYGS) for identification.

TYGS infers species/subspecies taxonomic classification and phylogenies using digital DNA:DNA hybridization (dDDH) of whole genomes against a type-strain database of \(\GT\) 15k spp.

A dDDH d4 score of >70% is the threshold for classification to the species level.

Table 2.4: TYGS results for type-strain matches to our strains.
Strain Species d4 StdDev Closest.Type.Strain Strain.1
1SSLD Erwinia sp. 22.5 +/-2.5 E. oleae DAPP-PG-531
2ALA1 Pseudomonas sp. 47.6 +/-2.6 P. wadenswierensis CCOS 864
2PtLD Serratia sp. 63.8 +/-2.9 S. nematodiphila DSM 21420
2RDLD Serratia marcescens 92.1 +/-1.7 S. marcescens sakuensis KCTC 42172
3RS3 Pantoea sp. 53.3 +/-2.7 P. endophytica 596
3ThS2 Pseudomonas sp. 39.9 +/-2.6 P. antarctica LMG22709
3WL2 Acinetobacter soli 88.7 +/-2.1 A. soli KCTC 22184
3YPLB Pseudomonas koreensis 80.1 +/-2.6 P. Koreensis JCM 14769
3YPLD Pantoea sp. 53.4 +/-2.7 P. endophytica 596
4RDLA Erwinia sp. 22.0 +/-2.4 E. aphidicola JCM 21238
4RLE Pantoea agglomerans 71.7 +/-2.8 P. agglomerans NBRC 102470

Summary: Four of the strains, 2RDLD, 3WL2, 3YPLB, and 4RLE, were identified to the species level, while the remaining 7 strains, 1SSLD, 2ALA1, 2PtLD, 3RS3, 3ThS2, 3YPLD, and 4RDLA, were only matched to the genus level, indicating that they may represent novel species.

Erwinia. Two strains, 1SSLD and 4RDLA, were classified as Erwinia, although neither was matched to the species level. Phylogenetically, strain 1SSLD was most closely related to E. oleae (d4 = 22.5 ±2.5) and 4RDLA most closely related to E. aphidicola (22.0 ±2.4), although the dDDH values classified both as distinct species from their nearest matches, as well as from the other members of the Erwinia genus, including the pathogens, E. amylovora and E. pyrifoliae and the non-pathogen E. tasmaniensis (2, 3).

3 Final Notes

Thumbnail: click to enlarge

That completes the assembly, annotation, and identification protocols for a new isolated strain. Next will be the genetic characterization, where we will mine the genomes for genes/systems of interest; which will be considerably more challenging and interesting.

Remember to document what you do. Like all protocols, these are the standard steps and any modification or deviation should be thoroughly documented. Also, recall that your notebooks are legal documents which may get reviewed in the future; this could be supervisors, colleagues, collaborators, or reviewers or other scientists pre- and post-publication.

Original raw data should not be altered. Any changes, formatting, additions, should be done in R Studio. This is includes excel spreadsheets, downloaded files, etc. and is especially important for data provided to you from other people or sources.

It is particularly important to record all software and application versions and the dates accessed. If you submit data over several dates, or resubmit data, then note that in the dates accessed section. Include what was submitted when, and if resubmitted why it was done.

Save all files carefully and in their correct folder. Stick with the naming conventions as much as possible. This provides clarity for future researchers (and yourself in 6 months) that may be revisiting the notebook.

Remembering to download data files is particularly important due to the fact that many of the web platforms we use only retain the data for a period of time. This is a bit of a pain, but a fair trade for accessing the applications free of charge. Some of the graphical output is reproducible using the downloaded data files (e.g. annotations, phylogenetic trees), but there are some cases when it is important to download copies of any graphics you want to have access to in the future.

Finally, relax! One of the benefits of bioinformatics research is that you can’t ruin the experiment! Unlike forgetting to water the plants or using the wrong culture media, archived copies and good reproducibility practices keeps original data intact. You may not get the correct results - or you may corrupt your local copy of the data, but you can always go back one or more steps if needed. This gives researchers the ability to explore different or redundant methods for the same data set; this is roughly equivalent to adding experimental treatments, or having multiple lines of evidence to support your hypotheses.


References

Aziz R. K., D. Bartels, A. A. Best, M. DeJongh, T. Disz, et al., 2008 The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics 9: 75. https://doi.org/10.1186/1471-2164-9-75
Carriço J. A., M. Rossi, J. Moran-Gilad, G. Van Domselaar, and M. Ramirez, 2018 A primer on microbial bioinformatics for nonbioinformaticians. Clinical Microbiology and Infection 24: 342–349. https://doi.org/10.1016/j.cmi.2017.12.015
Forouzan E., M. S. M. Maleki, A. A. Karkhane, and B. Yakhchali, 2017 Evaluation of nine popular de novo assemblers in microbial genome assembly. Journal of Microbiological Methods 143: 32–37. https://doi.org/10.1016/j.mimet.2017.09.008
Hardoim P. R., L. S. van Overbeek, G. Berg, A. M. Pirttilä, S. Compant, et al., 2015 The hidden world within plants: Ecological and evolutionary considerations for defining functioning of microbial endophytes. Microbiology and Molecular Biology Reviews 79: 293–320. https://doi.org/10.1128/MMBR.00050-14
Janda J. M., and S. L. Abbott, 2007 16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls. Journal of Clinical Microbiology 45: 2761–2764. https://doi.org/10.1128/JCM.01228-07
Lucaciu R., C. Pelikan, S. M. Gerner, C. Zioutis, S. Köstlbacher, et al., 2019 A Bioinformatics Guide to Plant Microbiome Analysis. Front. Plant Sci. 10: 1313. https://doi.org/10.3389/fpls.2019.01313
Meier-Kolthoff J. P., A. F. Auch, H.-P. Klenk, and M. Göker, 2013 Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics 14: 60. https://doi.org/10.1186/1471-2105-14-60
Parks D. H., M. Imelfort, C. T. Skennerton, P. Hugenholtz, and G. W. Tyson, 2015 CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25: 1043–1055. https://doi.org/10.1101/gr.186072.114
Reinhold-Hurek B., and T. Hurek, 2011 Living inside plants: bacterial endophytes. Current Opinion in Plant Biology 14: 435–443. https://doi.org/10.1016/j.pbi.2011.04.004
Rizzi R., S. Beretta, M. Patterson, Y. Pirola, M. Previtali, et al., 2019 Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quantitative Biology 7: 278–292. https://doi.org/10.1007/s40484-019-0181-x
Segerman B., 2020 The Most Frequently Used Sequencing Technologies and Assembly Methods in Different Time Segments of the Bacterial Surveillance and RefSeq Genome Databases. Front. Cell. Infect. Microbiol. 10: 527102. https://doi.org/10.3389/fcimb.2020.527102